Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative fix for #3805 by creating a snapshot and using it #3827

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zecke
Copy link
Contributor

@zecke zecke commented May 1, 2024

Introduce types.AlertSnapshot and types.AlertsSnapshot that keeps track of the original dispatch time and uses it for the Status() and Resolved() method. This makes sure the notification is consistent across retries and/or other delays.

Change the behavior to not zero EndsAt for firing alerts and update the acceptance tests.

@zecke
Copy link
Contributor Author

zecke commented May 1, 2024

@grobinson-grafana A full mr and separating the snapshot-ing from the behavior change. The failing test seems unrelated.

@grobinson-grafana
Copy link
Contributor

@grobinson-grafana A full mr and separating the snapshot-ing from the behavior change. The failing test seems unrelated.

Awesome! Yes that test flakes a lot. It needs to be fixed 😞

@zecke zecke force-pushed the freyth-fix-ends-at branch 2 times, most recently from bd37add to 933c3af Compare May 5, 2024 10:04
@grobinson-grafana
Copy link
Contributor

Thanks once again for opening this @zecke! I'm going to review it this week! 👍

notify/discord/discord.go Outdated Show resolved Hide resolved
notify/test/test.go Outdated Show resolved Hide resolved
types/types.go Outdated Show resolved Hide resolved
types/types.go Outdated Show resolved Hide resolved
}
}

func (a *AlertSnapshot) Resolved() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is what I was expecting 👍

types/types.go Outdated Show resolved Hide resolved
for _, a := range alerts {
v := a.Alert
// If the end timestamp is not reached yet, do not expose it.
func (as AlertsSnapshot) Less(i, j int) bool {
Copy link
Contributor

@grobinson-grafana grobinson-grafana May 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can share the Less function between AlertSlice and AlertsSnapshot? Then we can avoid duplication of the same code! 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we do not need both AlertSlice and AlertsSnapshot. We can either a.) rename AlertSlice to AlertsSnapshot and change its definition to []*AlertSnapshot or b.) delete AlertSlice and make the following additional changes:

diff --git a/api/v2/api.go b/api/v2/api.go
index 6f034d25..958d1db9 100644
--- a/api/v2/api.go
+++ b/api/v2/api.go
@@ -290,7 +290,7 @@ func (api *API) getAlertsHandler(params alert_ops.GetAlertsParams) middleware.Re
 			continue
 		}
 
-		alert := AlertToOpenAPIAlert(a, api.getAlertStatus(a.Fingerprint()), receivers)
+		alert := AlertToOpenAPIAlert(types.NewAlertSnapshot(a, now), api.getAlertStatus(a.Fingerprint()), receivers)
 
 		res = append(res, alert)
 	}
diff --git a/api/v2/compat.go b/api/v2/compat.go
index 112bfc4f..4b17b368 100644
--- a/api/v2/compat.go
+++ b/api/v2/compat.go
@@ -117,7 +117,7 @@ func PostableSilenceToProto(s *open_api_models.PostableSilence) (*silencepb.Sile
 }
 
 // AlertToOpenAPIAlert converts internal alerts, alert types, and receivers to *open_api_models.GettableAlert.
-func AlertToOpenAPIAlert(alert *types.Alert, status types.AlertStatus, receivers []string) *open_api_models.GettableAlert {
+func AlertToOpenAPIAlert(alert *types.AlertSnapshot, status types.AlertStatus, receivers []string) *open_api_models.GettableAlert {
 	startsAt := strfmt.DateTime(alert.StartsAt)
 	updatedAt := strfmt.DateTime(alert.UpdatedAt)
 	endsAt := strfmt.DateTime(alert.EndsAt)
diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index 8bf7ad6a..4860a64e 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -199,7 +199,7 @@ func (d *Dispatcher) run(it provider.AlertIterator) {
 
 // AlertGroup represents how alerts exist within an aggrGroup.
 type AlertGroup struct {
-	Alerts   types.AlertSlice
+	Alerts   types.AlertsSnapshot
 	Labels   model.LabelSet
 	Receiver string
 }
@@ -241,7 +241,7 @@ func (d *Dispatcher) Groups(routeFilter func(*Route) bool, alertFilter func(*typ
 			}
 
 			alerts := ag.alerts.List()
-			filteredAlerts := make([]*types.Alert, 0, len(alerts))
+			filteredAlerts := make([]*types.AlertSnapshot, 0, len(alerts))
 			for _, a := range alerts {
 				if !alertFilter(a, now) {
 					continue
@@ -258,7 +258,7 @@ func (d *Dispatcher) Groups(routeFilter func(*Route) bool, alertFilter func(*typ
 					receivers[fp] = []string{receiver}
 				}
 
-				filteredAlerts = append(filteredAlerts, a)
+				filteredAlerts = append(filteredAlerts, types.NewAlertSnapshot(a, now))
 			}
 			if len(filteredAlerts) == 0 {
 				continue
@@ -501,10 +501,10 @@ func (ag *aggrGroup) flush(notify func(...*types.AlertSnapshot) bool) {
 	alertsSlice := types.SnapshotAlerts(ag.alerts.List(), time.Now())
 	sort.Stable(alertsSlice)
 
-	resolvedSlice := make(types.AlertSlice, 0, len(alertsSlice))
+	resolvedSlice := make(types.AlertsSnapshot, 0, len(alertsSlice))
 	for _, alert := range alertsSlice {
 		if alert.Resolved() {
-			resolvedSlice = append(resolvedSlice, &alert.Alert)
+			resolvedSlice = append(resolvedSlice, alert)
 		}
 	}
 
diff --git a/store/store.go b/store/store.go
index 5de0c3ee..0ae0e853 100644
--- a/store/store.go
+++ b/store/store.go
@@ -116,7 +116,7 @@ func (a *Alerts) Set(alert *types.Alert) error {
 
 // DeleteIfNotModified deletes the slice of Alerts from the store if not
 // modified.
-func (a *Alerts) DeleteIfNotModified(alerts types.AlertSlice) error {
+func (a *Alerts) DeleteIfNotModified(alerts types.AlertsSnapshot) error {
 	a.Lock()
 	defer a.Unlock()
 	for _, alert := range alerts {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. The implication is that all callers of Groups including the API will work on a snapshot.

@zecke zecke force-pushed the freyth-fix-ends-at branch 2 times, most recently from 351a58e to fb91517 Compare May 17, 2024 13:37
Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments, otherwise LGTM. @gotjosh please review too!

@@ -28,6 +28,8 @@ func TestClusterDeduplication(t *testing.T) {
t.Parallel()

conf := `
global:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to fix flaking tests? If so, let's move it to a separate PR and I will approve it 🙂 It sounds pedantic, but it makes it much easier to use git blame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am certainly old school in bringing change sets to a GitHub pull request. Let's split the PR into two pieces then. The first to introduce snapshots and then the second to change the EndsAt behavior.

The reason to introduce the resolve timeout is from here:
2f05568

Update the acceptance tests as the webhook now has a non-zero EndsAt
timestamp for firing alerts. The models.GettableAlert doesn't have a
`status` field which means the current test is unable to differentiate
between firing and resolved alerts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR to introduce the snapshot is #3882. I will update this PR to include the behavior change for the webhook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK! I didn't see that, my mistake! 😞 I thought it was to fix some flaking tests that were unrelated to this work, but I misunderstood!

@@ -29,6 +29,8 @@ func TestInhibiting(t *testing.T) {
// gets resolved.

conf := `
global:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here 🙂

@@ -21,6 +21,10 @@ import (
. "github.com/prometheus/alertmanager/test/with_api_v2"
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here 🙂

@@ -84,6 +84,8 @@ func TestSilencing(t *testing.T) {
t.Parallel()

conf := `
global:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here 🙂

@@ -274,6 +274,8 @@ func TestSendAlertsToUTF8Route(t *testing.T) {
t.Parallel()

conf := `
global:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here 🙂

@zecke
Copy link
Contributor Author

zecke commented Sep 11, 2024

How can we make progress on this change? I am happy to help reviewing other people changes to increase the bench of reviewers...

@gotjosh gotjosh mentioned this pull request Oct 22, 2024
Make sure that the alerts passed to notifyFunc are consistent regardless
of retries and other delays.

Signed-off-by: Holger Hans Peter Freyther <[email protected]>
Change the AlertGroup to use the AlertsSnapshot and the Group call
to take the snapshot time. Make changes to accomodate this.

Signed-off-by: Holger Hans Peter Freyther <[email protected]>
This commit fixes a bug where firing alerts had zeroed EndsAt
timestamps in notifications. This bug was introduced in another
PR (prometheus#1686) to fix issues where alerts are resolved during a flush.

(Commit message from George Robinson)

Update the acceptance tests as the webhook now has a non-zero EndsAt
timestamp for firing alerts. The models.GettableAlert doesn't havea
`status` field which means the current test is unable to differentiate
between firing and resolved alerts.

Signed-off-by: Holger Hans Peter Freyther <[email protected]>
@gotjosh
Copy link
Member

gotjosh commented Oct 23, 2024

I rebased this PR with main to fix the conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants